
Initially, we were interested in how payroll caps and luxury taxes had an impact on baseball teams. The data for the MLB, however, was difficult to obtain, so we explored other professional leagues. Eventually, we settled on the NBA and asked a similar question: what factors are critical to the success of an NBA team?
Factors that we investigated include:
We sought to determine which, if any, of these variables are best at predicting the success of a team regarding wins.
Pertinant data was pulled from these sources and added to excel and exported as a csv file. We chose five years as this time span is long enough to capture broader trends in the NBA.
# import libraries
import csv
import pandas as pd
import string
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import requests
import json
import hvplot.pandas
import seaborn as sns
import numpy as np
from scipy.stats import linregress
import holoviews as hv
hv.extension('bokeh')
from bokeh.models import ColumnDataSource, Legend
from bokeh.plotting import figure, show
from bokeh.transform import dodge
from bokeh.io import export_png
from bokeh.io import export_svgs
from math import pi
from config import geoapify_key
Bad key "text.kerning_factor" on line 4 in /opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle. You probably need to get an updated matplotlibrc file from https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template or from the matplotlib source distribution
# Import NBA franchise data from 2017-2021 and replace NaN with zeros
file_path = 'Resources/NBA-Complete.csv'
NBA_df = pd.read_csv(file_path)
NBA_df = NBA_df.fillna(0)
# Display df
NBA_df.head()
| Team | Age (2017) | W (2017) | L (2017) | Pct (2017) | Attend. (2017) | Income (2017) | All Stars (2017) | Age (2018) | W (2018) | ... | Attend. (2020) | Income (2020) | All Stars (2020) | Age (2021) | W (2021) | L (2021) | Pct (2021) | Attend. (2021) | Income (2021) | All Stars (2021) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Atlanta Hawks | 27.9 | 43 | 39 | 0.524 | 654306 | 22 | 1.0 | 25.4 | 24 | ... | 545453 | 36 | 1.0 | 25.4 | 41 | 31 | 0.569 | 59288.0 | 37 | 0.0 |
| 1 | Boston Celtics | 25.9 | 53 | 29 | 0.646 | 760690 | 85 | 1.0 | 24.7 | 55 | ... | 610864 | 86 | 2.0 | 25.1 | 36 | 36 | 0.500 | 30067.0 | 46 | 2.0 |
| 2 | Brooklyn Nets | 26.0 | 20 | 62 | 0.244 | 632608 | 52 | 0.0 | 25.1 | 28 | ... | 524907 | 44 | 0.0 | 28.2 | 48 | 24 | 0.667 | 30491.0 | -80 | 2.0 |
| 3 | Charlotte Hornets | 26.5 | 36 | 46 | 0.439 | 710643 | 21 | 1.0 | 26.6 | 36 | ... | 478591 | 36 | 0.0 | 24.6 | 33 | 39 | 0.458 | 68255.0 | 34 | 0.0 |
| 4 | Chicago Bulls | 26.9 | 41 | 41 | 0.500 | 888882 | 95 | 1.0 | 24.4 | 27 | ... | 639352 | 115 | 0.0 | 25.6 | 31 | 41 | 0.431 | 13655.0 | 39 | 1.0 |
5 rows × 36 columns
x = NBA_df['Team']
y1 = NBA_df['Pct (2017)']
y2 = NBA_df['Attend. (2017)']
#figure(figsize=(300, 100), dpi=20)
plt.rc('font', size=10)
# Create the bar graph
fig, ax1 = plt.subplots()
ax1.bar(x, y1, color='y')
ax1.set_xlabel('Teams')
ax1.set_ylabel('Win %', color='k')
# Create the second set of y-axis labels
ax2 = ax1.twinx()
ax2.stem(x, y2,use_line_collection=True)
ax2.set_ylabel('Attendance', color='k')
ax1.set_xticklabels(x, rotation=90)
# Save plot in outputs folder
plt.savefig("Output/Attd0.png", dpi=300, bbox_inches='tight')
# Show the plot
plt.show()
x = NBA_df['Team']
y1 = NBA_df['Pct (2019)']
y2 = NBA_df['Attend. (2019)']
#figure(figsize=(300, 100), dpi=20)
plt.rc('font', size=10)
# Create the bar graph
fig, ax1 = plt.subplots()
ax1.bar(x, y1, color='y')
ax1.set_xlabel('Teams')
ax1.set_ylabel('Win %', color='k')
# Create the second set of y-axis labels
ax2 = ax1.twinx()
ax2.stem(x, y2,use_line_collection=True)
ax2.set_ylabel('Attendance', color='k')
ax1.set_xticklabels(x, rotation=90)
# Save plot in outputs folder
plt.savefig("Output/Attd1.png", dpi=300, bbox_inches='tight')
# Show the plot
plt.show()
sns.set_style("whitegrid")
sns.lmplot(x="Age (2021)", y="Pct (2021)", data=NBA_df, line_kws={"color":"red"})
plt.xlabel("Average Team Age (2021)", fontweight='bold')
plt.ylabel("Team Win Percentage (2021)", fontweight='bold')
plt.savefig("Output/Age0.png", dpi=300, bbox_inches='tight')
plt.show()
sns.set_style("whitegrid")
sns.lmplot(x="Age (2020)", y="Pct (2020)", data=NBA_df, line_kws={"color":"orange"})
plt.xlabel("Average Team Age (2020)", fontweight='bold')
plt.ylabel("Team Win Percentage (2020)", fontweight='bold')
plt.savefig("Output/Age1.png", dpi=300, bbox_inches='tight')
plt.show()
sns.set_style("whitegrid")
sns.lmplot(x="Age(2019)", y="Pct (2019)", data=NBA_df, line_kws={"color":"blue"})
plt.xlabel("Average Team Age (2019)", fontweight='bold')
plt.ylabel("Team Win Percentage (2019)", fontweight='bold')
plt.savefig("Output/Age2.png", dpi=300, bbox_inches='tight')
plt.show()
sns.set_style("whitegrid")
sns.lmplot(x="Age (2018)", y="Pct (2018)", data=NBA_df, line_kws={"color":"purple"})
plt.xlabel("Average Team Age (2018)", fontweight='bold')
plt.ylabel("Team Win Percentage (2018)", fontweight='bold')
plt.savefig("Output/Age3.png", dpi=300, bbox_inches='tight')
plt.show()
sns.set_style("whitegrid")
sns.lmplot(x="Age (2017)", y="Pct (2017)", data=NBA_df, line_kws={"color":"green"})
plt.xlabel("Average Team Age (2017)", fontweight='bold')
plt.ylabel("Team Win Percentage (2017)", fontweight='bold')
plt.savefig("Output/Age4.png", dpi=300, bbox_inches='tight')
plt.show()
# Show scatter chart
show(nba_scatter)
# Print the r-value for each year
print(f'The r-value is: {rvalue}.')
print(f'The 2017 r-value is: {rvalue1}.')
print(f'The 2018 r-value is: {rvalue2}.')
print(f'The 2019 r-value is: {rvalue3}.')
print(f'The 2020 r-value is: {rvalue4}.')
print(f'The 2021 r-value is: {rvalue5}.')
The r-value is: 0.25013462921759505. The 2017 r-value is: 0.34129414919782575. The 2018 r-value is: 0.6360615671322686. The 2019 r-value is: 0.3987123814842942. The 2020 r-value is: 0.4003994838684177. The 2021 r-value is: 0.4203519954833007.
# Show chart
show(nba)
We seek to determine whether the size of a franchise's media market has any impact on its success. We define success as either a winning record or profitibality.
Maps are created to visualize the data, and statistical tests are done to make conclusions.
Finally, a correlation heatmap is created with all variables of interest in the study. A summary and final thoughts are presented.
The data is comprised of the main data compiled by our team as well as media market data obtained from hoop-social.com. An API search is also done to pull geocoordinates of each team's arena.
Additional variables are calculated for use in summarizing out findings.
# Create a function to remove punctuation from columns in a dataframe
def remove_punctuation(input_string):
# Make a translation table to remove all punctuation characters
translator = str.maketrans('', '', string.punctuation)
# Use translate method to remove all punctuation characters
no_punct = input_string.translate(translator)
return no_punct
# Pull arena latitude and longitude from geoapify app
for index, row in full_NBA_df.iterrows():
arena = row["Arena"]
target_url = f"https://api.geoapify.com/v1/geocode/search?text={arena}&format=json&apiKey={geoapify_key}"
geo_data = requests.get(target_url).json()
try:
full_NBA_df.loc[index, "Arena Lat"] = geo_data["results"][0]['lat']
full_NBA_df.loc[index, "Arena Lon"] = geo_data["results"][0]['lon']
print(f"Coordinates found for {arena}")
except:
print('Could not find coordinates')
Coordinates found for State Farm Arena Coordinates found for TD Garden Coordinates found for Barclays Center Coordinates found for Spectrum Center Coordinates found for United Center Coordinates found for Rocket Mortgage Fieldhouse Coordinates found for American Airlines Center Coordinates found for Ball Arena Coordinates found for Little Caesars Arena Coordinates found for Chase Center Coordinates found for Toyota Center Coordinates found for Gainbridge Fieldhouse Coordinates found for Cryptocom Arena Coordinates found for Cryptocom Arena Coordinates found for FedEx Forum Coordinates found for FTX Arena Coordinates found for Fiserv Forum Coordinates found for Target Center Coordinates found for Smoothie King Center Coordinates found for Madison Square Garden IV Coordinates found for Paycom Center Coordinates found for Amway Center Coordinates found for Wells Fargo Center Coordinates found for Phoenix Suns Arena Coordinates found for Moda Center Coordinates found for Golden 1 Center Coordinates found for ATT Center Coordinates found for Scotiabank Arena Coordinates found for Vivint Smart Home Arena Coordinates found for Capital One Arena
# Add column for average win
Pct_df = NBA_df[["Team", "Pct (2017)", "Pct (2018)", "Pct (2019)", "Pct (2020)", "Pct (2021)"]]
full_NBA_df['mean Pct'] = Pct_df.iloc[:, 1:].mean(axis=1)
# Add column for average age
Age_df = NBA_df[["Team", "Age (2017)", "Age (2018)", "Age(2019)", "Age (2020)", "Age (2021)"]]
full_NBA_df['mean Age'] = Age_df.iloc[:, 1:].mean(axis=1)
# Add column for average income
inc_df = NBA_df[["Team", "Income (2017)", "Income (2018)", "Income (2019)", "Income (2020)", "Income (2021)"]]
full_NBA_df['mean Income'] = inc_df.iloc[:, 1:].mean(axis=1)
# Add column for average attendance
Att_df = NBA_df[["Team", "Attend. (2017)", "Attend. (2018)", "Attend. (2019)", "Attend. (2020)"]]
full_NBA_df['mean Attendance'] = Att_df.iloc[:, 1:].mean(axis=1)
# Add column for average payroll
pay_df = full_NBA_df[["Team", "Payroll (2017)", "Payroll (2018)", "Payroll (2019)", "Payroll (2020)","Payroll (2021)"]]
full_NBA_df['mean Payroll'] = pay_df.iloc[:, 1:].mean(axis=1)
# Add column for average rank
rk_df = full_NBA_df[["Team", "Rk (2017)", "Rk (2018)", "Rk (2019)", "Rk (2020)", "Rk (2021)"]]
full_NBA_df['mean Rank'] = rk_df.iloc[:, 1:].mean(axis=1)
# Calculate total wins over the five year sample period
wins_df = full_NBA_df[["Team", "W (2017)", "W (2018)", "W (2019)", "W (2020)", "W (2021)"]]
full_NBA_df['Total Wins'] = wins_df.iloc[:, 1:].sum(axis=1)
# Create Categories for metro size
bins = [0, 2500000, 5000000, 20000000]
labels = ['Small(<2.5M)', 'Medium(2.5M-5M)', 'Large(>5M)']
full_NBA_df['Metro Categories'] = pd.cut(full_NBA_df['Metro Population'], bins=bins, labels=labels, right=False)
# Create Categories for win count
bins = [130, 190, 250]
labels = ['Below Average', 'Above Average']
full_NBA_df['Win Categories'] = pd.cut(full_NBA_df['Total Wins'], bins=bins, labels=labels, right=False)
# Create Categories for income
bins = [10, 43, 150]
labels = ['Below Median', 'Above Median']
full_NBA_df['Income Categories'] = pd.cut(full_NBA_df['mean Income'], bins=bins, labels=labels, right=False)
# Configure the map plot showing metro pop by size and wins by color
wins_map = full_NBA_df.hvplot.points(
"Arena Lon",
"Arena Lat",
geo = True,
tiles = "CartoLight",
frame_width = 700,
frame_height = 500,
size = "Metro Population",
scale = 0.01,
color = "Total Wins",
hover_cols = ["Team"],
clabel = 'Total Win Count',
title = 'NBA Teams by Win Count and Market Size'
)
# Save to output folder
hvplot.save(wins_map, 'Output/NBAmapWins.html')
# Show map
wins_map
# Configure the map plot showing metro pop by size and mean income by color
income_map = full_NBA_df.hvplot.points(
"Arena Lon",
"Arena Lat",
geo = True,
tiles = "CartoLight",
frame_width = 700,
frame_height = 500,
size = "Metro Population",
scale = 0.008,
color = "mean Income",
hover_cols = ["Team"],
clabel = 'Mean Income (Millions)',
title = 'NBA Teams by Mean Income and Market Size'
)
# Save to output folder
hvplot.save(income_map, 'Output/NBAmapIncome.html')
# Show map
income_map
From an initial glance at the first map, the number of wins does not seem to be correlated with the market size of a franchise.
In the second map, it does appear that the mean income of a franchise does seem to be correlated with the market size.
from scipy.stats import chi2_contingency
# Chi Square for Metro Size versus Income
# Create a contingency table
contingency_table = pd.crosstab(full_NBA_df['Metro Categories'], full_NBA_df['Income Categories'])
print(contingency_table)
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("----------------------------------------------------")
# Check the p-value to determine the significance of the test
if p < 0.05:
print("Reject the null hypothesis - The variables are dependent")
else:
print("Fail to reject the null hypothesis - The variables are independent")
print(f"The p-value is {p}.")
Income Categories Below Median Above Median Metro Categories Small(<2.5M) 7 2 Medium(2.5M-5M) 5 4 Large(>5M) 2 10 ---------------------------------------------------- Reject the null hypothesis - The variables are dependent The p-value is 0.017205950425851393.
# Chi Square for Metro Size versus Wins
# Create a contingency table
contingency_table = pd.crosstab(full_NBA_df['Metro Categories'], full_NBA_df['Win Categories'])
print(contingency_table)
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("----------------------------------------------------")
# Check the p-value to determine the significance of the test
if p < 0.05:
print("Reject the null hypothesis - The variables are dependent")
else:
print("Fail to reject the null hypothesis - The variables are independent")
print(f"The p-value is {p}.")
Win Categories Below Average Above Average Metro Categories Small(<2.5M) 4 5 Medium(2.5M-5M) 5 4 Large(>5M) 6 6 ---------------------------------------------------- Fail to reject the null hypothesis - The variables are independent The p-value is 0.8948393168143698.
The variables "Metro Population" "mean Income" and "Total Wins" were discretized into categories using the pd.cut method, and chi-square analysis was run to test our initial observations.
Looking at the results of chi square analysis, market size and total wins are independent. This means that the size of a market does not have an impact on a team's ability to win.
However, the variables of market size and mean income are dependent. This means that cities with larger media markets tend to be more profitable.
# Create new df with all variables of interest
avgs_df = full_NBA_df[["Total Wins", 'mean Attendance','mean Payroll','mean Income','Metro Population','mean Age']]
# Create a mask for the upper triangle so that values are only shown once
mask = np.zeros_like(avgs_df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask,k=1)] = True
# Create heatmap
sns.heatmap(avgs_df.corr(), annot=True, cmap='coolwarm',mask=mask)
# Save to output folder
plt.savefig("Output/NBAheatmap.png", dpi=300, bbox_inches='tight')
Our correlation matrix displays how all variables of interest are related to each other.
Attendance at games does not seem to impact a team's ability to win.
Player age actually has a positive, moderate relationship with a team's ability to win. More aggregate experience among players seems to lead to success.
A positive, moderate relationship exists for payroll and team wins. Better players will cost a premium.
A team's income, however, has little to do with a winning record.
A team's market size also has little to do with a team's ability to win. In fact, many teams in smaller metro areas have top-tier records. The relationship is negative but weak.
The market size of a team does impact a franchise's ability to generate income. Teams in larger metropolital areas are more successful from a financial standpoint.
Experience seems to be an important factor in an NBA team's ability to succeed as the relationship between wins and mean age was the strongest in our study. This experience, though, will come with a cost since high player payrolls are also correlated with wins.
From a financial perspective, a winning team does not necessarily translate large profits. The factors of market size and average attendance do more to predict a larger income than a team's win record. For an owner, the ability to generate income depends more on marketability.